DOCOMENT BESOMS 

ED 093 Oia CS 500 749 



AUTHOR 
TITLE 



INSTITUTION 

PUB DATE 
NOTE 



Buley, Jerry L* 

Criterion Referenced tteasureaent in 
Speech-CottBunication Classrooais : Panacea for 
Mediocrity. Research Report. 

Ari^.ona State Univ., Tempe, ComiBunication Research 

Center. 

Apr 7a 

16p.; Paper presented at the Annual Hoeting of the 
Central States Speech Cooaunica tion Association 
(Milwaukee, Wisconsin, April 4-6, 1974) ; Some pages 
flttay reproduce poorly 



EDRS PRICE 
DESCRIPTORS 



HF-$0.7S HC-$1*50 PLUS POSTAGE 

♦ CottEDunication (Thought Transfer) ; ♦Criterion 

Referenced Tests; ^Educational Besearcn; Higher 

Education ; Measureraent Instruments ; *Measurefflent 

Techniques; Norm Referenced Tests; ♦Speech 

Instruction 



ABSTRACT 

The philosophical uoderpinnings of the typical 
testing practices of speech communication teachers in regard to 
norm-^referenced measurement contain several assumptions which 
teachers may find untenable on closer inspection. So»e of the 
consequences of these assumptions are a waste of human potential, 
inefficient use of instructional expertise, development of negative 
attitudes toward school and self, and creation of mental health 
problems in a significant number of students. Criterion-referenced 
measurement was developed in response to the weaknesses of 
norm-referenced measurement, and tho assumptions of both types of 
measurement receive critical attention. (Author/RB) 



ERLC 



CRITERIOW^REFERENCED frEACUIiEMEirr 
XH SPEECH COMMUNICATION ClASSROOMSs 
PAKACEA FOR MEDIOCRir^ 

by Jerry L. Buley 

Abo tr Act 

Th« philosophical, underpinnings of the typical testing 
practicae (i.e,» norm^referenced measurement) of Speech Communi- 
eation teachers contain several asaumptiona which the saae 
teachers may find untenable upon closer inspection* Some of 
the conaequences of these asaumptions are a waate of human 
potential, inefficient use of instructional expertise, develop* 
Kent of negative attitudes toward school end self, and creation 
of mental health problem© in a significant number of etudente* 

Criterion^ referenced neaaurcmcnt v:a8 developed in reaponae 
to the weaJcnodsee of norm**referencod measurement , and uaed cor* 
rectly should alleviate many of the negative consequencee of 
the latter* 

Thie paper investig^atna tho aB5;u:3:::tiontJ of both types of 
measurement and discusaeo their ijDplic^itiono for instruction 
in the speech comiaunication clasoroon* 



CRITERION-REFERENCED MEASUREMENT IN SPEECH 
COWUNICATION CLASSROOMS t PAI/ACEA FOR MEDIOCRITY 

by Jerry Buley 
Arizona State University 

Introduction 

As in evex>ything man does^ a compromise must be made be*' 
'tween the best job he can possibly do and what is feasible; 
It ie no different in the design and administration of tests/ 
llie methodologists scream at us from one side that> because 
the results of tests are used to make critical decisions, 
the tests must be designed carefully and that we must be sure 
they are valid and reliable. On the other side we have all 
the responsibilities of teaching which eat into our time. 
Tile result is that we do the best we can do with what we have« 

We can make sure that we are using the best teaching 
methodology and measurement techniques which are consistent 
with tha time we have available to employ them. The purpose 
of this paper io (1) to describe the measurement practice, 
(2) present and criticize the assumptions underlying that 
practice, (3) to present an alternative measurement practice 
which would seem to meet theoe criticisms^ and finally 
diiouss methods for evaluating tests in which the alternative 
iseasxu?ement practice has been used« 

Current Measurement Practice 

Uaually what that means. is that we spend several hours 
developing the best test items we can. Then we administer 
the items to our students. If we have time and the approp^ 
^i4te facilities, we may perform an item ajialysie to find 



tht Illative discrimination and level of difficulty for each 
itea. Dieoriminationi as you may kno^f is a meaeure of how 
well the question cliff erentiatea between those who perform 
well on the test and those who do not perform well# The dif- 
fiouli"V of an item is simply the proportion of students get- 
ting the item wrong* 

Usually we nn^ results of an item analysis to select 
items for the same test when it is administered the next time 
to another group of students. 

After the students t&J<e the test we assign grades by look- 
ing at the distribution of scores on the test* Scores at the 
top of the distribution become A's> those between the top and 
the middle become B*s» those at the middle become C^s> and 
•o on* 

What I have deficribed is norm^referenoed measurement. 

That is, each student* a score is compared tc all scores to 

determine how well he has p^?rformed relative to all other 

students who took the same testi or > the norm* 

A^isumptions and Criticicma of Current 
Measurecicut Practice 

AsBumption I : Test should have a broad distri bution 
^ ^^'o? S cores " 

Instructors who use norm-referenced measurement are pri-* 

oarily interested in increasing the variability in the scores 

on their tests. For example ^ they may throw out questions 

which do not discriminate between those who perform poorly and 

those who pex^form quite welli Also^ they will include ques- 

tlone which range in difficulty from quite easy to quite 



difficult to make sure some people will got a few corx'eot 
and few will get them all correct. An argviment frequently 
used to defend this practice in that human behavior tends 
to be normally distributed along a single dimension « There- 
fore, it ia logical to expect that, oince test scores are 
examples of hi^n behiwiorj thoy will be norm-^lly distrib- 
uted, 

A criticism of this argument is that very few tcaoher- 
roade teste are unidimeneional , nor usually are they intended 
to be. Usually, a test covors a whole gamut of concepts and 
ideas even when only over a single unit of instruction. Some 
atudanta way perform bettor than others on some taske but worse 
than the others on other tasks. Tne result of this could be 
to decrease the variability of scoras oi' produce a very 
skewed distribution. Thus, there is no necesBary reason to 
believo that the distoibution of scoreo will conform to a 
normal curvo . 

A t!>3cond criticism of thii; ,Trgumtint io that it ansumes 
that scores, even following instruction, muiit be nor/iially 
distributed . If we v/ere training students how to talk vwy 
sexy, we night tell them they must Ic/.m to talk nlowly, 
with low volume , la^zy enunoiation with a lower pitch and with 
« lot of breathinesa. We might coach them over a period of 
time and then put them on an audio tape and grade thew. We 
would expect that nearly all of the students will perforin 
Boet of the behaviors correctly. In other words , the di9tri~ 
bution of scoraa would be very negatively skewed after 



InQti^ucticn. In fact, the better adapted our coaching is to 
th« Individual student's weoknecees, the less variability 
there will be in tho reeultu of measurements. Thus, to as- 
«U)Tia A normal distribution is to afisume that instruction has 
been designed to greatly benpJit o/iXy a minority of the stu- 
dents. 

A second argument used by instructom to defend their 
ojaphasisi on varisbality in scores is that cui'rent statiotical 
t«chnl<iu<js for obtaining '>.atiiaates of reliability and validity 
of taete require that the rosults contain some variability. 
If variability is low; i , e. . , there is little difference be - 
tTNecn high .md low scorsB, vjr are jnable to aocua^ately eeti- 
mate tho reliability of our Leot. 

We can often hfiar teachers saying that bKcause everyone 
rooeived a high scoro on a. pai'ticular teat the test must have 
been bad, or at the ieaat that it waa too easy,. If the test 
cannot diGcrirainate between the good and poor students and 
if a reliability coeffioicint cannot bfi derived, then it must 
b-n -a bad test . 

It io ponoiblc that a test may be absolutely roliable 
and valid in every sonivvi of both terms but recfiive a apui'ioualy 
low reliability ooej"f icient because? i:herc wan little varia- 
bility in th« suortis. You and I both would agree that to throw 
out a perfectly good test because of an artifact of the stat- 
istical technique iised to cvaJ.ueTtc it, «,;ouid ba Absurd. 



Many teachers vjho u^e nom -referenced maasuremc-nt believe, 
or act a8 though they believe > that their tests are only mQas- 
urcs of student accomplishrAent ^ For exauiple, if a claaa of 
etudaritfs has an average nurrtbex*^ of correct reaponBee of 30 on 
4 SO-item teat, such an instructor iu very liKely to say that 
this was a poor class. As anyone who has had a taste of caeas- 
uremant can tell U8> any te<5t score contains at least three 
coifiponents ; 

A. Student achievement 

B» Teacher pox*formance 
Measvu^ement orror* 

Wa have methods for estini^^ting the amount of measurement 
error in a test (Dick 8 Hagerty, 1971); We do not have much 
to help us separate out the- component associated with teacher 
perforinance and/or the component associated with student achieve 
ti^rxt. However J it is much more* probable that one person's 
behavior ie at deviance thrtn it in that the collective behav- 
ior of 3U people is at deviance • Thus , the laws of probability 
would su^rgost that j when the average score on a test is far 
below the possible fsooiv^^ the toa cherts performance may have 
been ot fault. Usually /this information is lost in the aa- 
iiigi\ment of grades because the highest scox^© gets an A despite 
the fact that it i^epresenls only 40 correct responsea on a 60- 
ite» tost. Should a ^^0 get an A one semester on the same test 
that it takes et58 to get an A in another semester because the 
teacher taught better? 



8 

Aggjjg^^QQ^ JXTj Competition produoes motivation 

For hundreds of years teachers' have asoumad that compe- 
tition increases motivation for grades. Norm^'ref erenced meae- 
ureinent> becauGG it compares otudonts to other Btudents^ is 
atroumed to produce competition b^tv;een tho etudisnts for grades. 
Beniarain Bloom (X971) in an article entitled, ''Affective Con- 
Requencas of School Achievement , has taken nox^ra-r^ef erenced 
Hiaasuramont to tack on this very point. t 

Bloom (1971> pp. 13--15) suggests that our schools actually 
have two our??icula. One is an axplicit curriculum which is 
the formal content the student is expected to l«arn. l*he 
othor i5 an implic:it curriculum which teachee the student who 
ha is in relation to others , While he may learn the latter 
Rlowlyj its effect i;} cumulative over a 7- to 12-year period 
In pchool. Thu^s it io not «oi«ething ho will easily forget • 

Throughout his progression thx^ough the levels of public 
aohool and college^ the student ift const.^ntly compared to 
the other atude/ity. Nowhere el^e in hi6 life is he judged so 
fraquontly by others and in such precise terms as he is in 
school. The xr^arjority of workers, for oxajnple, are expected 
to meet some minimal atandard of work—usually quite low. 

Bloom presents e,videncQ (1971,. pp. 15 -5:6) which shows 
that two-thirds of our E^tudents acquire a non -positive or 
even a very negative attitude: toward schoo3.0v leax^ning, and 



acadmlcG In gceneral (o.g,, Ruuaell, 19$95 Michael^ et al,, 
in^l and Keh^i, 

BiooD also citGfj evidence vhich shows that about two- 
third© of our* ritudeanta acquire a negative Ge?.f --concept ao a 
direct x^asult of always having somcione peirforming hotter than 
th«ay are perforrainc (Torshsn> 1969). Filially, Bloom prefientR 
othor cvidonae ';o the of fact that there is a relationship be- 
ini?een tho uao of present ro^aasm^envent practices and mental 
health. This in probably more '':ruG for the bottom third of 
the clafiaco thc?.n for tho other Vrio thirds (Glidewell, 1987 i 
Torshon, 19S9; and Bowers ^ 1962). ■ 

Tlie ror^ult of th;2 ug?^ of norci'^-ref eronced irieaBurejJieni: to 
px-'-oduce motivation to learn is that it oreateft an enoraaoue 
wacite of ht;man potonticil. The systcjn is geared to produce 
low oolf-oonceptf3 ; nnp;ative attittidefi toward ao<idemi^, and 
^v^n Toay ht-) linA-ed to nu-j-atal dyG^unctions . 

At the tixi\o, ^ i:hp-' x\se. of norm-roferonoed mcaswement 

cr^at^n an inefficiimt U3<^ of inatxutotional expertise. Be- 
Gfivae soma Mtuc^t-nte ontcr an inrri ructioncJ.l unit without the 
IcnowledgG or j^hiXl^i nocesGcry to lovvcn tho content of that 
unit, the toachfn^ T^ihMt t>.*^y to bring Ui6> weak students up to 
the required lo vol. Thi^ io <^ ^lexite of tim<s for the students 
i^?ho alre^idy havo th<^ /r<^q'uirad skillM or knowledge . It also 

'^Mth:>U2h taoom did rot broach the aubjeot^ thin may be 
ft cif^nif icr.nt r^<t?on v;liy school bond elQction^ have wor* 



r«v,nt8 tho instructor fro:, presenting the instructional 
unit in th, nost efflciont ranner, resulting in . «a,t. of ' 
noney r.nd teacher rnnpov.'cr . 

In e,.^.-.x^y, r,ny if not an of the assumptions of norm, 
roforonoed ,..asuren,.t r,ay be untenable upon closer inspec- 
tion. In Addition, there .r, affective, mental health, and 

economic dicaclvantci'^'c>i5 -^^•{n ^viv.^ ^ 

vam:t.,,oo to t.lio current measurement practice. 

Ci-itorion-RcfoPoncci Moasurc-nent ; An Alternative 
a.itc.-.lon-rcforc:-.c=d ne.ourcment is not really a n,« 
fom of n3«uror.ont. I-:; hoaro oone recemblenee to the use 
Of percentasoa to ov..lu.t, ctuc-nt performance. That is, 
00<i correct in ..n A, tO to is a D, and so on. The point 
is that tho st..or.t i. no lonr,=,r compared to other students. 
Xnotcd, th, ct.d=nt is c.p.red to a standard or criterion 
chooon by th, in,t.«tc. bcforo the instructional unit begins. 

t;>c irs^ruetc:, ..-ho c:.,oaoyo criterion-referenced measurement 
i3 primarily intc..,.t=d in. tcrtins as a feedback system than 
^0 a method for differentiating among students. It ie a dich- 
otccus oystcn. Kithor the .tudent has met the criterion, or 
ho h.n not. vn-.on the .e^ure.ent procedure is used in ccn- 
.^".,.i„. MtK (1,1,0k. 1971), the instructor 

«=cs th= information fron tc.tin^ to locate the areas in „Moh 
th. ctudont io wc,J= c,.d th,n focuses on them to bring the 
ctudont u, to the criterion level of performance. 

It io possible that all students in a class win g.t .11 
c-r-.--t cn n cr-iterlon-rcforenced test. In fact, th. 



9 



measurement is that the majority (or perhaps even all) of the 
atudents achieve a particvilar criterion level of performance 
(typically 80%) • 

Ae you can scei the procedure places the responsibility 
for grades on the teacher rather than on the students When 
ueed in conjunction with some form of mastery learning, . the 
8t*ident can experience success after success in the learning 
environment. This should alleviate many, if not all, of the 
affective conaequenctis of norm-referenced measurejnent * 

If all students meet the criterion in cue instructional 
unit, then they should have the same entry level of skills 
and/or knowledge for the next instructional unit. The instruc- 
tor in the next instructional unit would not have to spend 
time bringing some of the students up to the required level 
of performance. Therefore > there should be more efficient 
use of instructional expertise. 

Criterion-referonced measurement meets many^ if not all, 
of the oriticisme raised againat norm-referenced measurement. 
However* there is; a problem with criterion-referenced meas- 
urement. This is the problem of obtaining some measure of 
reliability. 

Since criterion -referenced measurement can— and probably 
vill-«have very little variability in the scores, it is not 
possible to use present statistical techniques to derive an 
f!»timate of reliability. I have developed a technique which 
may get around this that I would like to present • 



A Method for Evaluating Criterion-Referenced Measurement 

Reliability coefficients are based on an estimate of 
aeaourement error which is a siraple concept in psychometrics.^ 
It is the expression of the difference between actual reality 
and the image of reality produced by our measurement (Dick $ 
Hagerty. 1971, p. lOj Ferguson^ 1971, p, 3623. 

■ Since it is impossible to know what reality is except 
by some neasure of reality, poychometricians have tradition- 
ally takan two measures of the say\e reality and then coHipared 
them. To the extent that the two measures provide the same 
reeult, the measure i.s said to be reliable. The extent to 
which they do not provide the same result is called the unre- 
liability or measurement error in the measure. 

The p^yohometrician usually does not even obtain two 
oeaeures of reality and compare them. What he doee is to 
obtain one n;easur.v from each student and then find the mean 
of! all these . Since th« difforenoe between the mean and 
thft mean of all error in a tast asHusned to bo zeroj^ 
th« average difference hetweon the students' acoree and 
the mean for th« test may bo assumed to be the basis for an 
.«stimate of measurement error and ultimately reliability. 

/^Tho following discussion is based in large part on a 
paper I presented to the Purdue Doctoral Honors Seminar, 
1973. ' 

That is, error is randonnly distributed and thus would 
have a mean that is neither obove or below the mean of th« 
•coreB. 



11 



This exercise in tangential thinking has heretofore been 
the primary method for obtaining a meaoure of reliability of 
a test* If there is no variability in the test scores, then 
it is next to worthless. I suggest that we need to go back 
to the original conceptualization of error in measurement 
and obtain two measures of reality. 

A way to get these two estimates of reality is to ask 
the student to evaluate the accuracy of his own response and 
then compare that with the instructor's evaluation of the 
student's response. The deviation between the instructor's 
evaluation and the student's evaluation is thus another esti- 
mate of error . 

Any given response from a student can be evaluated by 
the instructor as either right or wrong. The student can per- 
ceive his response to be either right or wrong. Thus we have 
the following two by two matrix: 

Instructor's Evaluation 
of Student's Response 

RIGHT WRONG 
WRONG 1 2 

Student's 
Evaluation 
of Own 
Response 

RIGHT 3 H 



When the student and the inittvuctor agrea on tho correct- 
ness of the student response (Cells 2 and 3), we can say 
there is no error either in tho measurement device oi^ in the 
instructional unit. However, when thft instructor and the 
student disagree as to the correctness of the response (Cells 
1 and we know there ia error somewhere. Either the meas- 
uring instruin»2nt has led the student astray, or the instructor 
haa taught oomething other than what he thought he had taught. 

In ordar to obtain thio data, al.\ that need be dona ia 
aJjaply to ask th© student to i^espvond twice to each item on a 
tast» The first rv^aponnfa U his answer to the question. 
The second response in his evaluation of his own x^esponse. 

l-herft are several interesting things which might be done 
with thie data. First the instructor could sum the frequency 
of occurrence of Colle 1 and 4 aoroao all subjects for each 
item. This would tell him which items contain or measure the 
moot error . 

Also, the instructor could sum the . f re,queuay of occur- 
ranee of Cells 1 and 1 acroi^s all itorae and all students and 
divide by the number of students times the number of items. 
This would p^r-oduoe an eetimato of the total amount of measure" 
Bient error in the toot. . • 

Another interesting use might be to sum the frequency of 
Cells 1 and U across all items for each student to f i«d tW 
students who are the most in error (i.e. , those who deviate 
«08t t>equently from the instructor's evaluation of the cor- 
reotnose of a response). 



13 

Finally, the instructor might administer the test at the 
b«ginn3,ng and the end of the inetructional unit and look at the 
changes in the frequencies in the cells of the matrix. One 
would expect that the frequencies in Cells 1 and k would go 
<3ot/n (error) and the frequencies in Cells 2 and 3 would go 
up (non error) . 

If the inetructor finds , for example , that the frequency 
of Cell 4 increases between the two administrations, he has 
definite evidence that hs was teaching the wrong behavior as 
bfting correct . 

While the ideas T liave presented above seem conceptually 
oox^nd, I have not conducted any definitive research to verity 
that they are sound, I do use the techjnique in aJLl of my own 
testing -J however J and I have found it to bo very v^iiluabl^* 

The socumptionf? of the traditionally used norm-referenced 
raeasuramonx were found to b<^ untenal)l'e after examination ♦ 
FuT'ther, the use of nox^nvref orenced Jj^eaeurement in Speech 
Comm\mication classroorrjs juay be a^?sociated with mediocrity 
or wor&o. * 

Crit^or^ion-ref erenced measurement .jii^ronises a panacea, 
•^megialK-j^-if it is a^ed in conjunction with jnc^st<^ry learning. 
Tho major weakness of criterijaj^efcrenced meacurfiffient ^^^^^^^X^ in 
th^ difficulty of obt^ilning an estimate of reliability . An 
innovative method for accomplishing this is to compare stu*- 
dent and instructor evaluations of the correctness of thf^ 
ftudent^s t^esponse* 



Practice* 



References 

Block, J. R. (ed.) Has tery Learning r Theory and 
New York: Holt, Sinehart and Wins"ton, 1971.' 

Bloom, B. S. Affective Consequences of School Achievement. 
In Block, J. R. <ed.) Hastery Learning; The ory and 
Practice. New York: HoTFTTUnehart and »inston7 T?71 . 

Bower, E. M. Mental Health in Kducation. Review of Kduca- 
tional Research, 196R, 32.. '♦m^'iSU. — ~ 

Buley, J. L. Measurement Error in Criterion-Referenced 
Measurement. Paper presented at the Purdue Doctoral 
Honors Seminar on Trends and Issues in Coovnunication 
Education, February, 1973. 

Dlok, W., S Hagerty, N. Topics in Measurement ; Reliability 
fin5^,. Validity. New Vorkl McGraw>mri7~r57rr''~'*' 

Ferguson, G. A. Statistical Analysis in Psychology and Edu- 
cation. NcwlJTJrFr'^I^I^rav^^ 

Kahn, S. B. Affective Correlates of Academic Achievement. 
Journal of Educational Psychology , 1969, SOj 216-221. 

Michael, W. B, , Baker, C, S Jones, R. A. A Note Concerning 
the Predictive Validities of Selected Cognitive and Hon 
Cognitive Measure's for Freshmen Students in a Liberal 
Arte College. Fduoational and Psychological Measuremen ts 
19 6n, 2«*, 373-37T: ~ 

KuB^ell, I. L. Motivation for School" Achievement ; Measure- 
ment and Validation. Journal of EduG ational R esearc h, 
1069, 62, 263-205. — 

Stringer, L. A, , S Glidewellj J. C. Early Detection of Emo- 
tional Illnesses in School Children. Final Report. 
St, Louis, Miss.: St. Louis County Health Department, 
1967. 

Torahon, K. The Relation of Classroom Evaluation to Students' 
Self -Concepts and Mental Health. Unpublished Ph.D. Dis- 
nertation, University of Chicago, 1969, 



